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Report  for  W911NF1210404:  High  Dimensional  Learning 


Motivation 

Today  we  are  facing  a  “data  deluge”  in  almost  every  domain.  Online  social  networks  have  seen  an 
explosion  in  activity  and  have  fundamentally  transformed  the  nature  of  human  interaction.  In  the 
biological  realm,  modern  genome  sequencers  can  output  data  at  a  rate  400  times  faster  than  the 
ones  a  decade  ago,  and  so  on.  However,  although  having  a  transformative  potential,  the  data  deluge 
has  not  yet  been  exploited  to  the  fullest  extent.  Ironically,  the  data  deluge  has  also  resulted  in  a 
“data  desert”.  The  collected  data  in  many  domains  are  noisy,  subsampled,  with  typically  a  large 
number  of  variables  or  “unknowns”  compared  to  the  number  of  observations  or  the  “knowns”. 
Such  high-dimensionality  entails  practical  principled  approaches  for  learning  from  ill-posed  and 
ill-behaved  data. 

Some  of  the  fundamental  questions  in  high-dimensional  learning  are:  Can  we  design  scalable 
models  for  efficiently  representing  and  learning  high-dimensional  data?  Here,  scalability  refers 
to  low  computational  requirements  and  reduced  sampling  of  high-dimensional  data.  Not  all 
phenomena  can  be  learnt  in  a  scalable  manner.  Can  we  characterize  the  fundamental  limits 
on  complexity  of  learning  complex  phenomena?  As  part  of  this  project,  the  PI  has  tackled  the 
above  challenges  by  exploiting  “inherent  data  architecture” .  This  can  be  in  the  form  of  structural 
relationships  among  the  variables,  represented  as  graphs,  or  as  parametric  forms,  represented 
as  tensor  decompositions.  The  PI  has  developed  novel  approaches  for  handling  such  high¬ 
dimensional  data. 

1  Summary  of  Results:  Tensor  Approaches  for  Learning  Latent 
Variable  Models 

Mixture  Models:  Classically,  latent  variables  have  been  incorporated  via  mixture  models.  A 

mixture  model  can  be  thought  of  as  selecting  the  distribution  of  the  observed  variables,  based  on 
a  so-called  latent  choice  variable.  Gaussian  mixtures  are  the  most  well  studied  class  of  mixture 
models.  Recently  the  so-called  class  of  exchangeable  topic  models  such  as  latent  Dirichlet 
allocation  have  been  popular  for  modeling  large  word  corpora  [1].  These  models  incorporate 
documents  with  multiple  hidden  topics.  We  propose  efficient  methods  for  learning  these  popular 
mixture  models. 

Challenges:  Learning  general  latent  variable  models  through  maximum  likelihood  is  NP-hard. 

Previous  methods  with  theoretical  consistency  guarantees  have  high  computational  and  sample 
complexity  which  typically  scale  exponentially  with  the  latent  space  dimensionality.  The  current 
practice  for  estimating  latent  variable  models  is  mostly  through  local  search  heuristics  (e.g.,  the 
EM  algorithm)  which  are  prone  to  failure  in  high  dimensions. 
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Spectral  Approach  to  Inverse  Moment  Methods:  The  method  of  moments  presents  a 

powerful  alternative  to  EM  and  other  heuristics.  The  basic  paradigm  of  method  of  moments  [2] 
is  to:  (i)  compute  certain  statistics  of  the  data  —  often  empirical  moments  such  as  means  and 
correlations  —  and  (ii)  find  model  parameters  that  give  rise  to  (nearly)  these  moments.  The 
second  step  of  equation  solving  to  obtain  the  parameters  can  typically  reduced  to  operations  on 
the  “spectrum”  of  matrices  and  tensors  obtained  from  the  moments.  Finally,  these  problems  have 
efficient  iterative  methods  to  find  the  solutions,  even  though  they  are  non-convex. 

Single  Topic  Exchangeable  Model:  Consider  a  simple  bag-of-words  model  for  documents 

in  which  the  words  in  the  document  are  assumed  to  be  exchangeable.  Recall  that  a  collection  of 
random  variables  xi,X2,  ■  ■  ■  ,X£  are  exchangeable  if  their  joint  probability  distribution  is  invariant 
to  permutation  of  the  indices.  The  well-known  De  Finetti’s  theorem  [3]  implies  that  such  ex¬ 
changeable  models  can  be  viewed  as  mixture  models  in  which  there  is  a  latent  variable  h  such  that 
xi,X2,  ■  ■  ■  ,X£  are  conditionally  i.i.d.  given  h  (see  Figure  1  for  the  corresponding  graphical  model) 
and  the  conditional  distributions  are  identical  at  all  the  nodes. 

In  our  simplified  topic  model  for  documents,  the  latent  variable  h  is 
interpreted  as  the  (sole)  topic  of  a  given  document,  and  it  is  assumed 
to  take  only  a  finite  number  of  distinct  values.  Let  k  be  the  number  of 
distinct  topics  in  the  corpus,  d  be  the  number  of  distinct  words  in  the 
vocabulary,  and  i  >  3  he  the  number  of  words  in  each  document.  The 
generative  process  for  a  document  is  as  follows:  the  document’s  topic  is 
drawn  according  to  the  discrete  distribution  specified  by  the  probability 
vector  w  :=  {wi,W2,  •  •  • ,  w^)  G  This  is  modeled  as  a  discrete  random 

variable  h  such  that  Pi[h  =  j]  =  Wj,  for  j  G  [k].  Given  the  topic  h,  the 
document’s  £  words  are  drawn  independently  according  to  the  discrete  distribution  specified  by  the 
probability  vector  fih  G  It  will  be  convenient  to  represent  the  £  words  in  the  document  by 

d-dimensional  random  vectors  xi,X2,  ■  ■  ■  ,X£  G  M^.  Specifically,  we  set 

xt  =  Ci  if  and  only  if  the  t-th  word  in  the  document  is  i,  t  G  [£], 

where  ei,e2,  ■  ■  ■  ed  is  the  standard  coordinate  basis  for  Because  the  words  are  conditionally 
independent  given  the  topic,  we  can  use  this  same  property  with  conditional  cross  moments,  say, 
of  xi  and  X2'. 


Figure  1:  Exchange¬ 
able  Topic  Models. 


E[xi  (g)  X2I/1  =  j]  =  E[xi|/i  =  j]  (g)  E[x2|/i  =  j]  =  j  G  [/c]. 

This  and  similar  calculations  lead  one  to  the  following  results:  If  M2  :=  E[xi  (g)  X2]  and  M3  := 
E[xi (g)X2 (gixs],  then  M2  =  hi® hi® hi--  III  [4],  the  PI  establishes 

that  for  many  classes  of  latent  variable  models,  including  spherical  Gaussian  mixtures,  latent 
Dirichlet  allocation  and  hidden  Markov  models,  using  low-order  moments  (typically  first, 
second-  and  third-order),  we  can  obtain  a  symmetric  tensor  form  as  above.  So  the  problem  of 
parameter  estimation  reduces  to  finding  the  components  of  the  tensor. 

Reduction  to  Orthogonal  Symmetric  Tensors:  While  general  tensor  decomposition  is  NP- 
hard,  the  PI  establishes  that  the  symmetric  tensor  decomposition  can  be  reduced  to  an  orthogonal 
symmetric  decomposition  given  the  moments  as  above,  when  the  number  of  topics  k  <  d,  where 
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d  is  the  dimension  of  observed  space  (i.e.,  vocabulary  size  for  topic  models).  Additionally  we 
require  non-degeneracy;  the  vectors  fii,  ii2,  ■  ■  ■ ,  IJ-k  £  are  linearly  independent,  and  the  scalars 
wi,W2,  ■  ■  ■  ,Wk  >  0  are  strictly  positive. 

Now,  let  W  G  be  a  linear  transformation  such  that  M2{W,  W)  =  W~^ M2W  =  /,  i.e.,  W 
whitens  M2.  Let  jli  :=  ^JWi  W~^ ^i.  Observe  that  M2{W,W)  =  so  the  fli  G 

are  orthonormal  vectors.  Now  define  M3  ;=  M-^{W,  W,  W)  G  so  that 

*3  =  E*”*  = 

1  -I  V 

1=1  1=1  ^ 

is  an  orthogonal  symmetric  tensor. 

Tensor  Power  Iterations:  Efficient  Methods  for  Tensor  Decomposition:  The  orthogonal 
tensor  decomposition  encountered  in  these  models  can  be  efficiently  solved  through  a  simple  power 
iteration  method.  For  a  tensor  T,  consider  the  vector-valued  map 

u  !-)■  T{I,  u,  u).  (1) 

This  can  be  explicitly  written  as  T{I,u,u)  =  Observe  that  (1)  is  not  a 

linear  map,  which  is  a  key  difference  compared  to  the  matrix  case. 

An  eigenvector  u  for  a  matrix  M  satisfies  M[I,  u)  =  \u,  for  some  scalar  A.  We  say  a  unit 
vector  u  G  is  an  eigenvector  of  T,  with  corresponding  eigenvalue  A  G  M,  if  T{I,u,u)  =  \u.  For 
orthogonally  decomposable  tensors  T  = 

k 

T{I,u,u)  =  '^\i{u^ Vifvi  . 

i=l 

By  the  orthogonality  of  the  Vi,  it  is  clear  that  T{I,Vi,Vi)  =  XiVi  for  all  i  G  [k].  Therefore  each 
{vi,Xi)  is  an  eigenvector /eigenvalue  pair.  Thus,  we  can  find  robust  eigenvectors  through  a  simple 

power  iteration:  9  1— )■  and  it  turns  out  that  all  the  basis  vectors  turn  out  to  be  robust, 

lb  (.'bblll 

Thus,  the  PI  presents  guaranteed  algorithms  for  learning  latent  variable  models  with  low  sam¬ 
ple  and  computational  complexities  (as  a  low  order  polynomial  in  the  latent  space  dimensionality). 
Additionally  a  subtle  perturbation  analysis  controls  the  perturbation  in  multiple  deflation  stages 
of  the  power  method.  This  can  be  seen  as  analogue  of  Weyl’s  and  Wedin’s  theorems  for  singular 
value  perturbation  for  matrices.  Moreover,  the  proposed  tensor  power  iteration  algorithm  is  effi¬ 
cient  for  large-scale  implementation  and  can  be  implemented  using  extremely  simple  linear  algebra 
operations  such  as  singular  value  decomposition  and  tensor  power  iterations. 

Method  of  Moments  for  learning  Community  Models  in  Social  Networks:  The  PI  has 

also  employed  the  method  of  moments  for  learning  another  class  of  latent  variable  models,  viz., 
community  models  in  social  networks  and  has  conducted  some  preliminary  on-going  work  in  [5] .  A 
community  generally  refers  to  a  group  of  individuals  with  shared  interests  (e.g.  music,  sports),  or 
relationships  (e.g.  friends,  co-workers).  In  [5],  The  PI  considers  a  mixed  membership  model 
which  incorporates  overlapping  communities,  i.e.,  an  agent  can  be  part  of  multiple  communities, 
which  is  realistic.  The  PI  proposes  a  novel  algorithm  for  learning  these  models,  based  on  simple 
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edge  counts  and  “3-star”  connts  (i.e.,  a  star  with  three  leaves).  This  is  the  first  work  to  present 
a  guaranteed  method  for  learning  mixed  membership  community  models.  Moreover,  the  results 
are  tight  and  match  the  best  known  bounds  (e.g.  for  spectral  clustering)  in  the  special  case  of 
the  stochastic  block  model,  a  well-studied  model  where  individuals  are  present  in  only  one 
community.  The  Pi’s  research  gronp  has  implemented  these  methods  on  graphics  processing  units 
(GPU),  which  makes  it  tractable  to  learn  communities  in  social  networks  in  an  extremely  fast 
manner  [6]. 

2  Summary  of  Results:  Learning  Graph-Based  Models 

Probabilistic  Graphical  Models:  One  graphical  framework  for  rep¬ 
resenting  high-dimensional  data  is  that  of  probabilistic  graphical  models, 
also  known  as  Markov  random  fields  or  Markov  networks.  A  Markov  net¬ 
work  represents  complex  relationships  between  data  at  different  nodes  in 
the  form  of  a  graph,  known  as  the  dependency  graph  [7-10].  Mathe¬ 
matically,  any  two  sets  of  nodes  A  and  B  are  conditionally  independent, 
conditioned  on  the  separator  set  S,  as  shown  in  the  fignre.  Hence,  the 
data  at  each  node  is  inflnenced  mainly  by  its  neighbors  in  the  dependency  graph.  A  Markov  repre¬ 
sentation  is  succinct  with  a  much  smaller  number  of  parameters  than  the  number  of  data  dimensions 
(variables),  and  at  the  same  time,  it  explicitly  encodes  the  relationships  between  the  variables. 

Formulation  of  Learning  from  Data:  Givenni.i.d.  data  samples  x"'  :=  [x(l),  x(2), . . .  ,  x(n)]^ 
from  a  graphical  model  P  with  Markov  graph  G,  the  goal  is  the  estimate  the  underlying  graph. 
The  PI  proposes  methods  and  provides  consistency  guarantees  for  graph  estimation  in  the  high 
dimensional  regime. 

Structure  Learning  with  Hidden  Variables:  Developing  tractable  methods  to  discover 

hidden  nodes  and  the  overall  graph  structure(s)  (and  parameters)  was  an  important  goal  of  this 
project.  The  PI  has  developed  efficient  methods  for  learning  latent  variable  models  in  a  variety  of 
settings.  This  includes  the  development  of  novel  methods  for  learning  hidden  tree  models  [11,12], 
which  are  especially  relevant  in  phylogenetics  [13].  Phylogenetics  involves  the  estimation  of  the 
evolutionary  tree  process  which  resulted  in  the  present-day  species.  The  developed  algorithms  have 
low  sample  complexity  and  are  much  faster  and  more  robnst  than  the  state  of  art.  The  algorithm, 
at  a  high  level,  maintains  a  tree  model  in  each  iteration  and  adds  hidden  variables  by  conducting 
local  tests.  This  property  is  unique  to  our  approach  and  makes  it  amenable  for  applying  it  to 
real  data  since  we  can  tradeoff  model  complexity  and  data  fitting  in  a  principled  and  an  efficient 
manner.  The  PI  has  extended  these  methods  for  learning  latent  loopy  models  with  long  cycles  [14], 
and  has  demonstrated  effectiveness  in  financial  and  topic  modeling. 

Bayesian  Networks  with  Latent  Variables:  In  addition  to  incorporating  latent  variables, 

it  is  important  to  model  the  complex  dependencies  among  the  variables.  In  [15],  the  PI  provides 
novel  methods  for  learning  directed  acyclic  graphs  (DAG)  with  hidden  variables.  The  method  is 
based  on  the  intuition  that  learning  is  tractable  when  there  is  snfficient  expansion  in  the  DAG  from 
hidden  to  observed  variables  (e.g.  when  it  is  latent  tree  or  has  small  number  of  collidiers,  i.e.,  nodes 
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with  multiple  parents).  This  work  combines  sparse  dictionary  learning  with  method  of  moments  in 
a  novel  manner  and  is  the  first  work  to  provide  guaranteed  learning  for  latent  Bayesian  networks. 
This  has  implications  in  many  practical  settings,  e.g.  for  learning  correlated  topic  models. 

Modeling  Using  Multiple  Graphs:  Modeling  high-dimensional  data  involves  a  delicate  trade¬ 
off  between  faithful  representation  and  parsimony.  Models  which  are  sparse  in  some  domain  achieve 
a  parsimonious  representation  but  may  poorly  fit  the  given  data.  The  PI  has  developed  frameworks 
for  relaxing  the  sparsity  constraints  without  sacrificing  on  parsimony  in  high  dimensions.  One 
framework  involves  incorporating  hidden  factors  which  can  change  the  structural  (and  paramet¬ 
ric)  relationships  among  the  observed  variables  [16],  thereby  resulting  in  a  mixture  of  probabilistic 
graphical  models.  The  PI  has  developed  methods  with  guaranteed  recovery  of  mixture  components 
which  are  also  efficient  for  practical  implementation.  The  PI  has  also  considered  another  approach 
for  modeling  with  multiple  graphs.  In  [17],  the  observed  data  is  fitted  to  a  combination  of  a  sparse 
graphical  model  and  a  sparse  independence  model,  thereby  incorporating  different  kinds  of  statis¬ 
tical  relationships  among  the  variables.  The  PI  has  developed  novel  decomposition  methods  based 
on  convex  relaxation  with  guaranteed  recovery  in  both  the  domains. 

The  above  developed  algorithms  have  been  applied  by  the  PI  to  a  number  of  practical  problems 
including  financial  and  document  modeling  [11],  object  recognition  in  computer  vision  [18],  to 
track  the  evolution  of  dynamic  social  networks  [19]  and  to  model  gene  associations  [20].  The  Pi’s 
approaches  have  shown  a  huge  improvement  over  previous  ones  in  all  these  instances. 

3  Significance  and  Impact  of  Conducted  Research 

Impact  on  the  theory  of  high-dimensional  learning:  The  Pi’s  recent  contributions  lie  at  the 
forefront  of  innovation  in  big  data  and  high-dimensional  machine  learning.  She  has  provided  a  new 
theoretical  understanding  of  tractable  models  and  regimes  for  high-dimensional  learning,  developed 
novel  approaches  for  handling  massive  scale  data  and  also  analyzed  fundamental  limits  on  learning. 
Her  work  has  direct  implications  to  the  areas  of  machine  learning,  statistics  and  algorithms,  as  well 
as  to  a  number  of  applications  such  as  social  network  analysis,  document  categorization,  computer 
vision,  recommendation  systems,  and  computational  biology. 

The  approaches  employed  by  the  PI  involve  a  cross-pollination  of  tools  and  techniques  from  ma¬ 
chine  learning,  statistics,  signal  processing,  information  theory,  optimization,  random  graph  models, 
and  social  sciences.  In  particular,  her  work  brings  together  techniques  from  machine  learning  and 
statistics  (e.g.  probabilistic  graphical  models,  mixture  models),  information  theory  (fundamental 
information  limits),  signal  processing  (e.g  independent  component  analysis),  optimization  (e.g  con¬ 
vex  relaxation  techniques  and  tensor  algebra),  statistical  physics  (e.g.  phase  transitions)  and  social 
sciences  (e.g.  community  formation  models).  This  cross-disciplinary  fusion  of  methods  allows  the 
problem  of  data  deluge  to  be  tackled  in  ways  far  more  effective  than  any  individual  approach. 

Another  significant  contribution  by  the  PI  is  to  the  area  of  learning  latent  variable  mod¬ 
els.  It  is  widely  recognized  that  incorporating  latent  or  hidden  variables  is  a  crucial  aspect  of 
modeling.  Latent  variables  can  provide  a  succinct  representation  of  the  observed  data  through 
dimensionality  reduction;  the  possibly  many  observed  variables  are  summarized  by  fewer  hidden 
effects.  Further,  they  are  central  to  predicting  causal  relationships  and  interpreting  the  hidden 
effects  as  unobservable  concepts.  For  instance  in  sociology,  human  behavior  is  affected  by  abstract 
notions  such  as  social  attitudes,  beliefs,  goals  and  plans.  As  another  example,  medical  knowledge  is 
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organized  into  casual  hierarchies  of  invading  organisms,  physical  disorders,  pathological  states  and 
symptoms,  and  only  the  symptoms  are  observed.  However,  learning  general  latent  variable  models 
is  challenging  (in  fact,  it  is  NP-hard).  Previous  methods  with  theoretical  consistency  guarantees 
have  high  computational  and  sample  complexity,  which  typically  scale  exponentially  with  the  latent 
space  dimensionality.  The  current  practice  for  estimating  latent  variable  models  is  mostly  through 
local  search  heuristics  (e.g.,  the  EM  algorithm)  which  are  prone  to  failure  in  high  dimensions. 

The  PI  has  been  able  to  circumvent  the  above  challenges  and  she  has  developed  novel  scalable 
approaches  for  learning  a  wide  class  of  latent  variable  models,  which  are  guaranteed  to  succeed,  and 
require  only  polynomial  sample  and  computational  complexity  [4,5,15,21,22].  The  PI  has  been  able 
to  achieve  these  impressive  results  by  invoking  the  underlying  tensor  algebra  in  many  popular 
latent  variable  models  such  as  Gaussian  mixture,  latent  Dirichlet  allocation  and  hidden  Markov 
models.  These  models  are  relevant  in  a  number  of  applications  including  document  modeling, 
natural  language  processing,  as  well  as  detecting  overlapping  communities  in  social  networks. 

In  particular,  the  Pi’s  work  on  community  detection  [23]  is  the  hrst  guaranteed  approach  for 
learning  mixed  membership  commnnity  models,  which  are  highly  relevant  for  modeling  on¬ 
line  social  networks.  Community  detection  is  a  classical  problem  studied  in  theoretical  computer 
science,  statistics  and  sociology  (see  [23]  for  a  survey).  Previous  theoretical  guarantees  for  commu¬ 
nity  detection  were  mostly  limited  to  the  setting  where  each  node  belongs  to  a  single  community 
(popularly  known  as  the  stochastic  block  model).  In  contrast.  The  Pi’s  innovative  approach  pro¬ 
vides  guaranteed  recovery  of  hidden  communities  for  a  wide  class  of  models  where  communities  can 
overlap,  and  also  provides  tight  guarantees  for  the  special  case  of  the  stochastic  block  model.  Thus, 
the  Pi’s  work  significantly  advances  the  state  of  art  on  community  detection  in  social  networks. 

Impact  on  applications  of  high-dimensional  learning:  Research  involving  learning  from 

high-dimensional  data  has  widespread  application.  The  PI  has  been  actively  involved  in  transform¬ 
ing  her  theoretical  results  to  practical  algorithms  in  several  domains.  For  instance,  her  algorithms 
have  been  applied  for  text  modeling  [11,14],  to  automatically  categorize  words  into  (local)  hier¬ 
archies  of  topics.  It  has  been  applied  for  object  recognition  in  computer  vision  [18],  where 
robust  detection  is  achieved  by  exploiting  the  contextual  information  in  natural  images  using  co¬ 
occurrence  of  objects.  Another  important  application  is  to  model  the  co-evolution  of  vertices  and 
edges  in  dynamic  social  networks  [19].  Recently,  the  PI  is  collaborating  with  domain  experts 
to  apply  the  developed  algorithms  for  modeling  gene  associations  and  predicting  relationships 
between  regulators  and  genes  [20].  The  Pi’s  research  group  has  implemented  tensor  decomposition 
algorithms  on  graphics  processing  units  (GPU),  and  can  detect  overlapping  communities  in 
large  graphs  efficiently  [6].  In  all  these  instances,  The  Pi’s  approaches  have  shown  a  huge  im¬ 
provement  in  performance  over  previous  ones.  Thus,  The  PI  has  made  great  strides  in  pushing  the 
boundaries  of  large-scale  machine  learning,  on  both  theoretical  and  practical  fronts. 
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